A ground-truth dataset and classification model for detecting bots in GitHub issue and PR comments

نویسندگان

چکیده

Bots are frequently used in Github repositories to automate repetitive activities that part of the distributed software development process. They communicate with human actors through comments. While detecting their presence is important for many reasons, no large and representative ground-truth dataset available, nor classification models detect validate bots on basis such a dataset. This paper proposes dataset, based manual analysis high interrater agreement, pull request issue comments 5,000 distinct accounts which 527 have been identified as bots. Using this we propose an automated model bots, taking main features number empty non-empty each account, comment patterns, inequality between within patterns. We obtained very weighted average precision, recall F1-score 0.98 test set containing 40% data. integrated into open source command-line tool allow practitioners given repository actually correspond

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

the innovation of a statistical model to estimate dependable rainfall (dr) and develop it for determination and classification of drought and wet years of iran

آب حاصل از بارش منبع تأمین نیازهای بی شمار جانداران به ویژه انسان است و هرگونه کاهش در کم و کیف آن مستقیماً حیات موجودات زنده را تحت تأثیر منفی قرار می دهد. نوسان سال به سال بارش از ویژگی های اساسی و بسیار مهم بارش های سالانه ایران محسوب می شود که آثار زیان بار آن در تمام عرصه های اقتصادی، اجتماعی و حتی سیاسی- امنیتی به نحوی منعکس می شود. چون میزان آب ناشی از بارش یکی از مولفه های اصلی برنامه ...

15 صفحه اول

passivity in waiting for godot and endgame: a psychoanalytic reading

this study intends to investigate samuel beckett’s waiting for godot and endgame under the lacanian psychoanalysis. it begins by explaining the most important concepts of lacanian psychoanalysis. the beckettian characters are studied regarding their state of unconscious, and not the state of consciousness as is common in most beckett studies. according to lacan, language plays the sole role in ...

Collecting a Ground Truth Dataset for OpenStreetMap

The quality of OpenStreetMap (OSM) and volunteered geographic information (VGI) in general has already been discussed extensively in the literature. Researchers have looked at this issue from different angles such as credibility [2], trust [1], provenance [12, 9], precision [4], and communities [5]. Comparative studies often use commercial datasets or datasets from a national mapping agencies f...

متن کامل

A Synchronization Ground Truth for the Jiku Mobile Video Dataset

This paper introduces and describes a manually generated synchronization ground truth, accurate to the level of the audio sample, for the Jiku Mobile Video Dataset, a dataset containing hundreds of videos recorded by mobile users at different events with drama, dancing and singing performances. It aims at encouraging researchers to evaluate the performance of their audio, video, or multimodal s...

متن کامل

a framework for identifying and prioritizing factors affecting customers’ online shopping behavior in iran

the purpose of this study is identifying effective factors which make customers shop online in iran and investigating the importance of discovered factors in online customers’ decision. in the identifying phase, to discover the factors affecting online shopping behavior of customers in iran, the derived reference model summarizing antecedents of online shopping proposed by change et al. was us...

15 صفحه اول

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Journal of Systems and Software

سال: 2021

ISSN: ['0164-1212', '1873-1228']

DOI: https://doi.org/10.1016/j.jss.2021.110911